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Abstract 

The control, prediction, and understanding of epidemiological processes require insight into how infectious pathogens 
transmit in a population. The chain of transmission can in principle be reconstructed with phylogenetic methods which 
analyze the evolutionary history using pathogen sequence data. The quality of the reconstruction, however, crucially 
depends on the underlying epidemiological model used in phylogenetic inference. Until now, only simple epidemiological 
models have been used, which make limiting assumptions such as constant rate parameters, infinite total population size, 
or deterministically changing population size of infected individuals. Here, we present a novel phylogenetic method to 
infer parameters based on a classical stochastic epidemiological model. Specifically, we use the susceptible-infected- 
susceptible model, which accounts for density-dependent transmission rates and finite total population size, leading to a 
stochastically changing infected population size. We first validate our method by estimating epidemic parameters for 
simulated data and then apply it to transmission clusters from the Swiss HIV epidemic. Our estimates of the basic 
reproductive number R 0 for the considered Swiss HIV transmission clusters are significantly higher than previous 
estimates, which were derived assuming infinite population size. This difference in key parameter estimates highlights 
the importance of careful model choice when doing phylogenetic inference. In summary, this article presents the first 
fully stochastic implementation of a classical epidemiological model for phylogenetic inference and thereby addresses a 
key aspect in ongoing efforts to merge phylogenetics and epidemiology. 
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Introduction 

Phylogenetics is becoming increasingly popular thanks to a 
large availability of genetic sequence information, and conse- 
quently phylogenetic methods have successfully been applied 
to pathogen sequence data in order to obtain estimates of 
transmission and death rates (Pybus et al. 2001; Rasmussen 
et al. 201 1; Stadler et al. 2012). The basis of these phylogenetic 
methods is the evolutionary tree reconstructed from the sam- 
pled pathogen population (see Pybus and Rambaut [2009] 
and references therein). If the evolutionary rate of the 
pathogen is sufficiently high such that the evolutionary and 
epidemiological timescales are similar, then the evolutionary 
trees can give insight into the transmission dynamics of 
the disease (Pybus et al. 2001; Drummond et al. 2003; 
Grenfell et al. 2004). These phylogenetic methods are of 
statistical nature and assume an underlying model that 
describes both the evolutionary and the population dynamics 
of the genetic sequences (Pybus and Rambaut 2009). The 
choice of phylogenetic method thus comes with model 
assumptions that may or may not be appropriate to a specific 
question. Phylogenetic methods are therefore susceptible to 
model misspecifkation that can lead to incorrect parameter 
estimates. 

Models from mathematical epidemiology that describe the 
spread of a disease in a population are well established 



(Kermack and McKendrick 1927; Anderson and May 1991). 
These models are quite different in character from population 
models traditionally used in phylogenetic inference. There 
has been great effort to extend previous phylogenetic meth- 
ods to account for the particularities that accompany the 
population dynamics of infectious diseases (Volz et al. 2009; 
Frost and Volz 2010, 2012; Rasmussen et al. 2011). One of the 
important aspects of many epidemiological models is that 
they account for saturation effects in the number of infected 
individuals, meaning that transmission decreases as the pool 
of susceptible individuals is depleted. 

Most of the methods that infer epidemiological parame- 
ters from phylogenetic trees assume a population model that 
is based on the coalescent, which in its original formulation 
makes strong limiting assumptions (Kingman 1982). To over- 
come some of these limitations, there has been a range of 
work that has extended coalescent theory to incorporate 
more complex epidemiological models (Volz et al. 2009; 
Frost and Volz 2010; Koelle and Rasmussen 2011; Volz 
2012). Yet, these extensions all assume a deterministically 
changing population size. Stochastic effects, however, have 
been shown to be of importance when considering epidemi- 
ological processes (Rohani et al. 2002). 

More recently, Rasmussen et al. (2011) have used coales- 
cent models and sequential Monte Carlo methods together 
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with stochastic ordinary differential equations (ODEs) to infer 
parameters of the epidemiological process. Sequential Monte 
Carlo methods such as particle filters are powerful statistical 
tools that can approximate the likelihood exactly but come at 
a large computational cost (Wilkinson 2011). 

Alternatively, methods based on the birth-death model 
can be used for phylogenetic inference (Kendall 1948; 
Nee 2006; Stadler 2010). The birth-death model is a discrete 
stochastic description of the process governing the popula- 
tion dynamics. Phylogenetic trees are then produced by 
adding a sampling process to the birth-death model. 
Birth-death models naturally account for the stochasticity 
of the population size in epidemiological models and allow 
for the estimation of epidemiological parameters when the 
assumptions of the coalescent are not justified. 

The birth-death model has recently been extended to 
account for saturation effects in the case of species evolution, 
which has allowed for a much better fit to the number of 
lineages over time in small species trees (Etienne et al. 2012). 
This method cannot be directly applied to viral phylogenies 
as it requires all sequences to be sampled at a single point 
in time, a condition that is rarely satisfied for disease surveil- 
lance data. Furthermore, the method requires the solution 
of a series of high-dimensional initial value problems, which 
is computationally challenging and has only been success- 
fully performed for population sizes of the order of tens 
of species. 

Here, we present a new method for phylogenetic inference 
of epidemiological parameters which is based on the birth- 
death model. Our method accounts for both sequentially 
sampled genetic data (Stadler 2010) and saturation effects 
(Etienne et al. 2012). In particular, we estimate transmission 
and death rates, as well as the susceptible population size 
from sequence data under a stochastic susceptible-infected- 
susceptible (SIS) model. The SIS model is the standard model 
used to describe the spread of sexually transmitted diseases 
without immunity (Anderson and May 1991). We derive 
an expression for the likelihood of a transmission tree, 
which can then be used to estimate the model parameters 
in either a maximum likelihood (ML) or Bayesian framework 
(Drummond et al. 2002, 2005). We use a recently developed 
method to calculate matrix exponentials (Al-Mohy and 
Higham 2011) in order to solve the high-dimensional initial 
value problems required to compute the likelihood. Our 
method can calculate the likelihood of a single tree for pop- 
ulation sizes of the order of 10,000 individuals. Using estimates 
for the epidemiological rates and the susceptible population 
size, we calculate the basic reproductive number, R 0 
(Anderson and May 1979). We validate our method by rees- 
timating parameters from simulated data using an SIS model. 
We then estimate epidemiological rates, the susceptible pop- 
ulation size, and R 0 for ten transmission clusters of the Swiss 
HIV epidemic (Kouyos et al. 2010; Schoeni-Affolter et al. 2010; 
Stadler et al. 2012). 

New Approaches 

In this section, we will give a brief summary of the new 
method used and refer to the supplementary information, 



Supplementary Material online, for a more detailed descrip- 
tion. In short, we calculate the likelihood that the observed 
phylogeny is a realization of a stochastic SIS model (see 
Materials and Methods). 

We assume a susceptible-infected (SI) model with constant 
total population size N as a model for transmission (see 
Materials and Methods). An outbreak begins with a single 
infected individual and the disease is transmitted with rate 
/3/N to susceptible individuals. Infected individuals are 
removed either through "death" with rate /j. or through sam- 
pling with rate xj/. Sampling corresponds to the case where 
individuals are sequenced (e.g., prior to treatment) and 
become noninfectious thereafter, for example, due to success- 
ful treatment or behavior change (Stadler et al. 2013). We 
assume that a removed individual is immediately replaced by 
a new susceptible individual, resulting in a constant popula- 
tion size N. Under this assumption, the SI model is equivalent 
to an SIS model. 

Based on a sampled phylogeny of an epidemic outbreak 
(see Materials and Methods), we derived an expression for the 
likelihood of a phylogenetic tree under an SI/SIS model. In the 
following, time will always be measured going backward from 
the present t 0 into the past. As with serial sampling in the 
coalescent framework (Drummond et al. 2002), we can split 
up a phylogenetic tree into time intervals between sampling 
times x and branching times y. During these time intervals, 
the number of lineages in the tree is constant, but it increases 
by 1 at a sampling event and decreases by 1 at a branching 
event (see fig. 1). We introduce the probability p,(\\ t), which 
is the probability that within the i-th time interval, exactly / 
infected individuals at time t gave rise to the phylogeny 
observed between that time t and the present time t 0 . 
Thus, for every /' and t, we can write the probabilities that 
0,1 ,2, . . . ,N infected individuals gave rise to the phylogeny as 
a vector p ; (t) = (p,(0; t),p,(1; t), . . . ,p,(N; t)). The time 
evolution of this vector of probabilities p f (t) is then governed 
by a birth-death master equation (see supplementary meth- 
ods, Supplementary Material online). The master equation 
translates to a system of linear ODEs for p,(t) within the 
i-th time interval, 

dPiU; t) 

-^-^ = - /(MO + fi + t)PiO\ 0 + (/ + <c/)A(/)p,(/ + 1 ; 0 
at 

+ (/ - k;)Atp;(/ - 1 ; t). 

0) 

Here, k t are the number of tree lineages during the i-th 
interval, /x is the death rate of individuals, x// is the sampling 
rate, and = /3(N — l)/N is the rate at which new 
infections occur in the SIS model. The above equation is 
linear in the p,(\\ t) and its solution can thus be written in 
matrix form as, 

Pit) = e^'VtM)- (2) 

The tridiagonal matrix C, contains all the information of 
the ODEs within the i-th time interval. This allows us to 
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Fic. 1. Example of an epidemic with sampled (red) and unsampled (gray) individuals. The top panel shows the infective periods of all individuals in 
the epidemic. The middle panel shows the infective periods of only the sampled individuals as well as the recreated transmission tree. The red dots are 
the sampling times of the individuals and the black triangles the branching times on the sampled phylogeny. Note that while we do not know the 
exact infectious periods of the sampled individuals, the transmission chain between two events can pass through multiple individuals. The bottom 
panel shows the corresponding phylogenetic tree with branching times x, and sampling times y y In this example, the joint event time vector is 
t = (0,y 1 ,y 2 ,y3,x 1 ,X2,y4,X3,x 4 ). The axis at the bottom of the figure shows how the matrices in equation (5) are applied to the probability vector p,(t) 
from the present time t 0 to the root of the tree t 2n (here n = 4 and t 8 = X4). The numbers along the axis are the number of extant lineages within 
that time interval. Note that the C, matrices are applied as matrix exponentials, i.e., p t (t + h) = e c,h Pi(t). The likelihood of the tree given that a single 
infected individual started the epidemic is then the entry of the vector of probabilities at the root, i.e., p 7 (t 7 ) for which the number of infected 
individuals / = 1. 



calculate p,(t,) at the end of the /-th time interval given the 
value ofpj(t,-_i) at the beginning of the /-th time interval. 

At the end of each time interval, the number of tree lin- 
eages k; either decreases by 1 at branching events or increases 
by 1 at sampling events. At a branching event, the number 
of tree lineages decreases by 1, so that the vector of 
probabilities has to be shifted and multiplied with the instan- 
taneous probability that a branching event occurred, 

p / (/;t,_ 1 ) = 2X(/)p;_ 1 (/-M;t,_ 1 ). (3) 

The factor 2 indicates that either one of the two descendants 
of the branching event may be the donor-infected individual. 
At a sampling event, the number of tree lineages increases 
by 1 and the vector of probabilities is multiplied by the 
sampling rate xjs, 

p / (/;t,_ 1 ) = V f P/-i('-1;t/-i)- (4) 

These shifts of the vector of probabilities can be summarized 
in a matrix D,. We can find the vector of probabilities at the 
root of the tree by iteratively applying e Cj and D, until we 



reach the root at time t 2n . This gives us the likelihood of the 
tree at the root, 

/CD = P2n(t2n) = (j[ D^'-^Wo). (5) 

If we assume that the epidemic was started by a single indi- 
vidual at time t 2n , then the likelihood of the tree C(T; 9) is 
the entry of \(T) for which the number of infected individuals 
/ = 1 and 9 represents the model parameters N,/3,[i, and x//. 
The vector of initial conditions (t 0 ) can be chosen to reflect 
prior knowledge on disease prevalence at the present time. In 
the absence of any prior information, all prevalence levels 
are equally likely, p^{t 0 ) = (1,1, . . . ,1). Finally, we condition 
the likelihood C(T; 9) on sampling at least one infected 
individual throughout the epidemic, C 0 {9) (see supplemen- 
tary information, Supplementary Material online). We use 
the conditioned likelihood C(T; 9)/C 0 (6) to estimate 
epidemiological parameters from the sampled phylogenies. 

The computation of the above likelihood using traditional 
matrix multiplication methods suffers from either poor 
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accuracy or computational intractability. To get around this, 
we use a novel method that can accurately and efficiently 
calculate matrix exponentials such as in equation (5) 
(Al-Mohy and Higham 2011). Our method is available both 
as an R package expoTree and C function (Leventhal 2013). 

Results 

Model Validation 

We test the validity of our model by simulating 1,000 trans- 
mission trees under 11 different parameter combinations 
using a stochastic SIS model with various population sizes 
N, recovery rates fi, and sampling rates iff until n samples 
are taken. The infection rate = 1 in all cases, because 
it is always possible to rescale time such that fi = 1 without 
loss of generality. In the density-independent (BD) case, 
sampling and death rates cannot be independently estimated 
(Stadler et al. 2013). Because no closed-form solution for 
the likelihood is available in the density-dependent case, 
it remains unclear whether all four parameters of the 
density-dependent model are independent and can be esti- 
mated separately (see supplementary methods, section A.7, 
Supplementary Material online). To account for potential 
dependence among the parameters in the density-dependent 
model, we estimate parameters for both an unconstrained 
(DD) and fixed (DDfj xed ) value of the sampling probability, 
r = ^- 

We then reestimate the parameters for each parameter 
combination in two ways, illustrating two alternative 
applications: 

(A) We reestimate the parameters for each of the transmis- 
sion trees using the ML estimator. This corresponds 
to the case when only a single sampled phylogeny has 
been reconstructed from the sequence data. The 1,000 
parameter estimates can be interpreted as a parametric 
bootstrap for the confidence interval around the true 
parameter value. 

(B) We estimate the posterior distribution of the parameters 
for trees #500-590 of the 1,000 simulated transmission 
trees jointly in a Bayesian framework (see Materials and 
Methods). This corresponds to the case when multiple 
independent trees are available with the same underlying 
stochastic process, for example, independent samples 
from the posterior distribution of trees estimated using 
Bayesian Markov chain Monte-Carlo (MCMC) estima- 
tion as in BEAST (Drummond and Rambaut 2007). 

Method A 

The bootstrapped confidence intervals for both the 
DDfixed and DD model all contain the input parameters (sup- 
plementary fig. S1, tables S1 and S2, Supplementary Material 
online). 

The confidence intervals of the DD model fully contain the 
confidence intervals of the DDfj xe d model, as expected (except 
for parameter sets 7 and 9, though the nonoverlap is very 
small). This is in agreement with previously reported results 
for the density-independent birth-death model, where /x and 
t/r only appear as the sum [i + ifr in the expression for Rq, 



and the explicit choice of [i given ijr does not change the 
value of R 0 . 

Method 8 

The bias of the posterior mean estimates of the parameters 
for all 11 parameter sets is listed in table 1. The posterior 
distribution could not be estimated for parameter sets 2 
and 4 in the DD model and for 6 and 7 in both the DD 
model and DDfj Xe d models, because the Markov chains did 
not reach a steady state. For these parameter sets, we report 
ML estimates for the joint likelihood of the subset of trees. In 
all cases, the density-dependent models provided a better fit 
to the data on the basis of the deviance information criterion 
(DIC). For the cases where posterior distributions could not 
be estimated, we calculated the deviance at the ML estimate. 
These cannot be directly compared with DIC values, as no 
correction for complexity has been performed. However, for 
those cases where DIC values could be computed, the effec- 
tive number of parameters was generally around 2-4 for the 
DDfj X ed model and 4-8 for the DD model. We therefore do 
not expect the differences in deviance on the order of 10 3 
between the BD and DD models for the parameter sets 6 and 
7 to be predominantly due to added complexity. 

In the DDfj X ed case, the absolute bias of the estimated 
parameters is generally less than 5%, with the exception of 
parameter set 3, where the bias on N was = —0.05 and 
ifjP = 0.06 on fi. For this parameter set, the true values 
of N,fi, and Rq, as well as for R 0 in parameter set 8, fall 
marginally outside the 95% highest probability density 
(HPD) interval (see supplementary tables S3 and S4, 
Supplementary Material online). However, when inferring 
parameters using different sets of simulated trees, the 
95% HPD intervals contain the true parameter (data not 
shown). 

Estimation bias using the posterior mean was generally 
larger in the DD model compared with the DDfj xe d model, 
though the credible intervals were comparatively large and 
all contained the true parameters, with the exception of x/r 
and r in parameter set 1. For those parameter sets where 
iff < (jl (sets 3, 4, 6, 7, 10, and 11), the bias of the posterior 
mean estimates was generally small. 

In most cases, the density-independent model did not 
return a posterior HPD interval that contained any of the 
input parameters, and the posterior means were heavily 
biased. In the remaining cases, this model was able to correctly 
estimate [i and if/ but strongly underestimated fi. The bias 
is strongest when N is small (i.e., large saturation effects), 
because fi is estimated as a time-averaged value of the 
force of infection A(/) = ,6(1 — l/N) (see Materials and 
Methods), and this time average is smaller than the actual 
fi. This is of particular importance when estimating 
R 0 = fi/(fi + \jr), where strongly underestimating fi will 
result in a strongly underestimated R 0 . 

We obtain a visual confirmation of the superior fit of the 
DD and DDfj Xe d models by plotting the lineages-through-time 
(LTT) of the simulated trees together with the LTT predicted 
by the estimated parameters (supplementary fig. S2, 
Supplementary Material online). The density-dependent 
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Table 1. The Relative Bias of the Estimated Parameters and DIC Values of the Fit. 



Tree 


Method 


N 


P 






Ro 


DIC 


Pv 


1 (n = 100) 


SIM 


100 


1 


0.1 


0.4 


2 










-0.02 


0.01 


-0.01 


-0.01 


0.02 


39,106 


3.15 




*DD 


0.41 


0.31 


3.23 




|-0.10 


38,822 


6.67 




BDfixed 


— 




|-0.02 


-0.02 




| 39,291 


1.84 


2 (n = 100) 


SIM 


500 


1 


0.1 


0.4 


2 








*DD fixed 


-0.04 


0.01 


-0.02 


-0.02 


0.03 


30,475 


3.06 




DD b 




— 






— 


— 






BDf, xed 














2.26 


3 (n = 100) 


SIM 


100 


1 


0.4 


0. 1 


2 








*DD fixed 






|-0.01 


0.01 




^| 70,059 


2.74 




DD 


0.11 


0.12 


0.15 


-0.09 


0.02 


70,064 


6.22 




BDfixed 


- 




|-0.01 


-0.01 




| 71,262 


2.19 


4 (n = 100) 


SIM 


500 


1 


0.4 


0.1 


2 










-0.04 


-0.00 


-0.00 


-0.00 


-0.00 


53,637 


3.18 




*DD C 


-0.09 


-0.03 


-0.07 


0.03 


0.02 


53,630" 


— 




BDfixed 


— 












2.06 


5 (n = 1000) 


SIM 


100 


1 


0.1 


0.4 


2 








*DD fixed 


0.01 


0.00 


0.00 


0.00 


0.01 


512,449 


2.89 




DD 


0.01 


0.02 


0.15 


-0.02 


0.00 


512,454 


6.04 




BD fixed 






|-0.00 


-0.00 




51 5,752 


1.87 


6 (n = 100) 


SIM 


1000 


1 


0.495 


0.005 


2 








*DD fixed c 


-0.00 


-0.01 


-0.01 


-0.01 


0.00 


110,267" 


— 




DD C 


0.03 


0.00 


0.01 


-0.03 


-0.01 


110,267* 


— 




BD fixed 


— 












1.84 


7 (n = 100) 


SIM 


10000 


1 


0.495 


0.005 


2 








DD fixed c 


0.04 


0.02 


0.03 


0.03 


0.02 


70,851" 


— 




*DD C 


0.04 


-0.02 


-0.04 


-0.03 


0.02 


70,851" 


— 




BDfj xed 


- 




|-0.03 


-0.03 




| 72,010 


2.26 


8 (n = 100) 


SIM 


100 


1 


0.02 


0.08 


10 








*DDfixed 


0.01 


0.02 


-0.02 


-0.02 




| 73,229 


2.63 




DD 


0.02 


0.01 


0.30 


0.01 


0.07 


73,231 


4.46 




BD fixed 


- 












2.09 


9 (n = 100) 


SIM 


500 


1 


0.02 


0.08 


10 








*DD fixed 


-0.05 


0.01 


0.04 


0.04 


-0.03 


45,025 


2.71 




DD 


-0.03 


0.02 


0.64 


0.03 


-0.08 


45,029 


5.7 




BDfi xed 


- 




|-0.03 


-0.03 




| 46,647 


2.13 


10 (n = 100) 


SIM 


100 


1 


0.08 


0.02 


10 








*DD fi xed 


-0.00 


-0.01 


-0.00 


-0.00 


-0.00 


112,519 


3.37 




DD 


0.03 


0.00 


0.03 


0.02 


0.02 


112,525 


7.42 




BD fixed 














1.99 


11 (n = 100) 


SIM 


500 


1 


0.08 


0.02 


10 








*DD fixed 


0.04 


-0.02 


-0.03 


-0.03 


0.01 


70,854 


3.06 




DD 


0.06 


-0.01 


0.00 


-0.03 


0.01 


70,859 


6.23 




BD fixed 














1.98 



Note. — The SIM entries are the input parameters to the simulation. DDfj Xe d, density-dependent model with fixed sampling probability; DD, density-dependent model with inferred 
sampling probability; BDfi xe( j, density- independent model with fixed sampling probability, n is the number of tips in the tree. The entries at N,f3,(A,$, and R 0 show the relative 
bias of the estimates. Smaller DIC values indicate a better fit of the model to the data. The model with the smallest DIC value is indicated by an asterisk for each of the 
parameter sets. 

a The HPD interval does not contain the true parameter value (shaded cells). 
b Numerical maximization of the likelihood failed. 

c The MCMC method did not converge under the Gelman-Rubin diagnostic. We therefore report ML point estimates and the deviance at the ML estimator. This is not 
equivalent to a DIC, but must be corrected by 2pv, where p v is the effective number of parameters. 
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models produce LTTs that more accurately reproduce 
the LTTs of the input trees compared with the density- 
independent model, especially when N is small. This is 
most pronounced when the number of sampled individuals 
is large compared with the total population size (n > N), 
indicating that the epidemic has been in the endemic 
equilibrium. 

Comparison with Logistic Coalescent 
In the parametric coalescent, the decrease in effective popu- 
lation size backward in time from an initial value N 0 is 
described by an arbitrary function (Griffiths and Tavare 
1994). In the case of an SIS model, the solution to the 
deterministic equations is a logistic function. We therefore 
compare our method with a coalescent model with popula- 
tion size governed by a logistic function (see supplementary 
methods, section A.8, Supplementary Material online), 



The parameters of the logistic function are related to the 
parameters of the SIS model by p = p — (ji + i/r) and 
cp = N/M 0l where N is the carrying capacity of the model 
(i.e., the total population size in the SIS model). The per lin- 
eage coalescent rate in calendar time is i? -1 = (fi + xj/)/M 0 
(supplementary methods, section A.8, Supplementary 
Material online). The three parameters of the logistic coales- 
cent are therefore linked to four parameters of the SIS model 
and A/l 0 . The input parameters of the trees simulated under 
the SIS model in Model Validation section can therefore only 
be converted to (p and ■& of the logistic coalescent by an 
appropriate choice of M 0 . The growth rate p can be converted 
independent of the choice of M 0 . In order to compare the 
parameter estimates from the logistic coalescent to our 
model, we choose M 0 as the total infected population size 
in a deterministic SIS model that started with a single infected 
individual after a time i 2 „ has passed (supplementary meth- 
ods, equation A.6.4, Supplementary Material online), where 
t 2n is equal to the mean height of the 1,000 simulated trees. 
For all parameter sets, the ML estimator is either heavily 
biased or the 95% confidence intervals contain the input 
parameter but are extremely large (supplementary table S5, 
Supplementary Material online). Bayesian MCMC was 
generally not possible, as the Markov chains never reached 
a steady state. 

HIV 

We applied both the DD and DDf; xe d methods to ten trans- 
mission clusters of the Swiss HIV Cohort Study (SHCS) 
(Kouyos et al. 2010; Schoeni-Affolter et al. 2010). For each 
of the clusters, 90 trees were sampled from the posterior 
distribution of trees previously determined using BEAST 
(Drummond and Rambaut 2007; Stadler et al. 2012). The 
90 trees were chosen from the MCMC chain, such that 
they can be considered independent identically distributed 
(iid) samples from the posterior distribution of trees. We then 
estimated the posterior distribution of parameters using 



method B (see Model Validation and Materials and 
Methods). In our model, "death" corresponds to any process 
resulting in an individual becoming noninfectious, and a 
"dead" individual is assumed to be replaced by a susceptible 
individual. In the case of the SHCS, individuals mainly become 
noninfectious when they start anti retroviral treatment, upon 
which their viral load is suppressed and consequently also 
onward transmission. 

Of all the patients who start treatment in Switzerland, 
around 75% are included in the SHCS (Schoeni-Affolter 
et al. 2010). This corresponds to the sampling probability, 
r = i/ f /( 1 A + m)- To account for other reasons for becoming 
noninfectious, we fix the sampling probability in the DDfj xec | 
model at three different values of r = 0.25, 0.5, and 0.75. 

To test for potential bias in the estimates, we simulated 
transmission trees under the SIS model using the estimated 
values for each of the clusters and subsequently reinferred the 
model parameters for the simulated data. We did not find any 
evidence of estimator bias for the DDfj xe d model. For the DD 
model, however, we detected nonnegligible bias of the 
estimators. We therefore only report the estimated parame- 
ters for the DDfj xe d model with r = 0.75 in table 2 and for 
r = 0.25 and r = 0.5 in the supplementary materials, 
Supplementary Material online. 

For clusters 3 and 10, the posterior of N was bounded on 
top by the upper limit of the uniform prior and the MCMC 
chain did not converge. This indicates that these clusters have 
a very large total population size and are more appropriately 
estimated using the density-independent model (supplemen- 
tary table S8, Supplementary Material online). 

In the remaining eight clusters, the population size esti- 
mates vary from small, that is, same order of magnitude as the 
number of sampled individuals (e.g. N (8) = 14 = n), to large, 
that is, N > > n (e.g., N (6) 6 [61.9,1780]). This indicates that 
there are some clusters where the epidemic has saturated 
and only few new infections are occurring (N % n), but also 
other clusters where new infections are still common 
(N >> n). Similarly, estimates of R 0 range from moderate 
(e.g., R ( o } = 2.80) to very large (e.g., = 13.7). 

Using the estimated parameters, we plot the LTT as well 
as the inferred prevalence within the transmission clusters. 
Two example clusters are shown in figure 2 and all the clusters 
in supplementary figures S3 and S4, Supplementary Material 
online. Visual inspection of these LTT plots confirms that the 
density-dependent model replicates the distribution of phy- 
logenetic trees better than the density-independent model, 
with the exception of clusters 3 and 10. Interestingly, for 
cluster 3, neither model is able to adequately produce an 
acceptable LTT plot, suggesting that an SIS model is not an 
appropriate representation of this transmission or that the 
inferred phylogenetic tree is questionable. 

Discussion 

This study proposes a method that extends previous work on 
birth-death models for phylogenetic inference (Etienne et al. 
2012; Stadler et al. 2012) and combines both density depen- 
dence and longitudinal sampling into a single framework. 
Previous methods based on the coalescent that can infer 
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Table 2. Epidemiological Parameter Estimates for the 10 Swiss HIV Transmission Clusters Under the DDn XK [ Model with r - 


= 0.75. 


Cluster 


i a 


2 a 


3 


4 a 


5 a 


n 


34 


29 


27 


26 


25 


N 


188 [160,218] 


36.1 [32.2,40.6] 


- 


39.4 [35.5,43.9] 


65.7 [58.7,73] 


P 


0.269 [0.258,0.282] 


0.5 [0.458,0.543] 


— 


1.02 [0.952,1.08] 


0.622 [0.594,0.658] 


f 


0.00973 [0.009,0.011] 


0.0341 [0.032,0.036] 


— 


0.0808 [0.076,0.086] 


0.0113 [0.01,0.012] 


'I' 


0.0292 [0.027,0.032] 


0.102 [0.096,0.108] 




0.243 [0.227,0.258] 


0.034 [0.031,0.037] 


Ro 


6.93 [6.29,7.62] 


3.67 [3.39,3.94] 




3.15 [2.88,3.4] 


13.7 [12.4,15.3] 


Cluster 


6 a 


7 a 


8 a 


9 a 


10 


n 


18 


17 


14 


14 


12 


N 


134 [70.5,1780] 


175 [128,256] 


14 [14,14.4] 


63.8 [53,76] 




IS 


0.442 [0.39,0.47] 


0.396 [0.372,0.42] 


0.916 [0.842,0.983] 0.672 [0.623,0.712] — 


f 


0.0394 [0.035,0.042] 


0.0154 [0.014,0.017] 


0.0427 [0.04,0.045] 0.0196 [0.017,0.023] — 


>!' 


0.118 [0.106,0.127] 


0.0463 [0.041,0.051] 


0.128 [0.119,0.135] 0.0587 [0.051,0.068] — 


Ro 


2.8 [2.37,3.06] 


6.42 [5.59,7.36] 


5.37 [4.76,5.98] 8.59 [7.17,9.93; 





Note. — n is the number of sampled individuals in each subepidemic. R 0 — + i/r) is the basic reproductive ratio. The reported values are the posterior mode and the 95% 
credible intervals from the posterior distribution. For the uniform prior used, the posterior mode corresponds to the ML estimator and only differed negligibly for the estimate of 
N in cluster 6. In cluster 6, the posterior distribution of N was heavy-tailed and bounded by the uniform prior N e [18,2000]. Therefore, the upper limit of the credible interval is 
likely an underestimate. 

a The fit of the density-dependent model is significantly better than the density-independent model. Model comparison is based on DIC values. 




1975 1985 1995 2005 1980 1990 2000 2010 2020 

Year 

Fig. 2. Lineage through time plots and prevalence curves of two example HIV subepidemics in Switzerland. Left panels: The gray lines are the LTT of 
the 90 samples from the posterior distribution of trees estimated using BEAST. The solid and dashed lines are the expected number of lineages for 
the density-dependent and density-independent SIS models, respectively, using the parameter estimates from table 2. Predicted LTT plots are 
almost identical when using parameter estimates for sampling probabilities r = 0.25 and r = 0.5. Right panels: Dashed lines correspond to the 
density-independent (BD) model and solid lines to the density-dependent (DD) model. The vertical dotted line indicates the time of the last sample 
in the tree. The gray steps are the actual cumulative number of sampled individuals over time and the red curves are the fitted functions. The black 
lines show the predicted prevalence from the fitted model. The predicted number of infected individuals (black) and cumulative number of sampled 
individuals (red) for the estimated parameter values. Although both model produce acceptable fits to the cumulative number of samples over time, 
the BD model predicts both the prevalence and cumulative number of samples to increase exponentially in the future, whereas the DD model can 
identify subepidemics that are already in the saturated phase. 
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density-dependent transmission rates rely on a deterministi- 
cally changing population size and cross-sectionally sampled 
data (Volz et al. 2009; Frost and Volz 2010; Volz 2012). Other 
methods based on birth-death models account for longitu- 
dinally sampled sequences and stochasticity but neglect 
density-dependent transmission (Stadler et al. 2012). 
Alternatively, semiparametric methods such as the various 
skyline-plot models (Pybus et al. 2001; Ho and Shapiro 
2011; Stadler et al. 2013) allow for transmission rates that 
change over time. These methods, however, require a parti- 
tioning of the past into periods of constant population size. 
Each of these partitions then requires an explicit transmission 
rate, resulting in a large number of parameters, though 
simulated trajectories from SIR models can be used to 
approximate the population sizes within the partitions 
(Kuhnert et al. 2013). Finally, sequential Monte-Carlo meth- 
ods can be used to estimate parameters (Rasmussen et al. 
2011) but are much more computationally intensive than 
exact likelihood methods. 

By using an SIS model, which is the standard epidemiolog- 
ical model for sexually transmitted diseases without 
immunity (Anderson and May 1991), we can account for 
density-dependent transmission with only one additional 
parameter compared with density-independent (constant 
rate) birth-death models. This allowed us, for the first 
time, to estimate not only the transmission and removal 
rates but also the total susceptible population size of trans- 
mission groups from viral sequences. The parametrized SIS 
model can predict how the number of infected and suscep- 
tible individuals will vary over time. Thus, if an SIS model 
is a good representation of an ongoing epidemic, it is possible 
to make predictions of how the epidemic will continue to 
develop. 

Furthermore, we have shown that in principle, our method 
can estimate sampling and death rates independently in 
some cases, but that this signal is generally weak and esti- 
mates obtained without prior knowledge of the sampling 
probability must be carefully examined for potential biases. 
Generally, estimator bias for the DD model is small in those 
simulated data sets where the epidemic reached an endemic 
equilibrium well before all samples were taken, such that a 
large part of the samples are from the saturated phase. 

Our method is able to estimate parameters with higher 
accuracy compared with parametric coalescent-based infer- 
ence with deterministically changing population sizes. This is 
especially true when the epidemic has already reached the 
saturated phase. In the deterministic SIS model, the change in 
the number of infected individuals is described by a logistic 
function, which can be used to model the population size 
decline backward in time in a parametric coalescent (Griffiths 
and Tavare 1994). The parameters of the parametric coales- 
cent with logistic population decline depend on the effective 
population size at present, that is, the total number of 
infected individuals at present. In this context, an important 
difference between the stochastic SIS model and the deter- 
ministic SIS model becomes apparent. In the deterministic 
model, the infected population size asymptotically 
approaches the endemic equilibrium. In contrast, in the 



stochastic model, the process reaches a quasi steady-state, 
in which the infected population size fluctuates stochastically 
around an equilibrium value. This means that although the 
stochastic SIS model can account for processes that have 
reached a quasi steady-state, the deterministic model inter- 
prets small deviations from the asymptotic equilibrium as 
a signal of how long the epidemic has been in a state close 
to this equilibrium. This can lead to incorrect parameter 
estimates. 

We have demonstrated that using an epidemiological 
model together with phylogenetic inference can indeed 
lead to new insights into ongoing epidemics. Susceptible pop- 
ulation sizes in 8 out of 10 transmission clusters of the SHCS 
were estimated to be small. This is an indication that the 
subepidemics in these transmission clusters are characterized 
by an initial rapid spread that subsequently slows down 
after only a relatively small number of infections. Such a 
scenario is conceivable when the population is composed 
of many small susceptible subgroups, and transmission 
event between subgroups are much rarer than within 
subgroups. This is compatible with previous findings 
which showed that HIV epidemics are driven by heteroge- 
neous populations, such as heterogeneity in infection rates or 
heterogeneity in number of contacts (Liljeros et al. 2001; 
Kouyos et al. 2010; Leventhal et al. 2012; Stadler and 
Bonhoeffer 2013). 

Differences between transmission clusters are further 
reflected in estimates of the basic reproductive number, R 0 
(Anderson and May 1979). Because the estimate of the 
susceptible population size is small, the estimated R 0 is 
larger than previously reported for the density-independent 
case (Stadler et al. 2012). The reason for the previous under- 
estimation of R 0 is the following: The basic reproductive 
number is defined as the number of secondary infections 
caused by an infectious individual in a completely susceptible 
population. If the pool of susceptible individuals is very 
large, then R 0 is roughly equal to the number of secondary 
infections caused by any infectious individual, R n (t). The 
density-independent model assumes that R\(t) is constant 
throughout the whole epidemic, and it is the time-averaged 
value of R^t) which is estimated by the method. In the early 
stages of the epidemic, the number of infected individuals 
grows exponentially, such that R n (t) is roughly constant 
and approximately equal to R 0 . Thus, when the sampled 
sequences are confined to the early stages of the epidemic, 
then the estimate of using a density-independent 
model is an acceptable approximation for R 0 . As the 
number of susceptible individuals decreases later on in the 
epidemic, so does R^r) and consequently the time average 
of Rt(0 is an underestimation of R 0 . The new density- 
dependent model takes the decrease in R-|(t) over time 
into account, such that more accurate estimates of R 0 can 
be obtained when sequences are available from all periods 
of the epidemic. 

The underestimation of R 0 is most extreme when the 
cluster was saturated while most of the individuals were 
sampled (e.g., SHCS cluster 5: R° D = 13.7,R£° = 2.67). In 
this cluster, the initial spread proceeded rapidly, after which 
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only few infections happened (see fig. 2). LTT plots can be 
used to get a visual confirmation that the density-dependent 
model does indeed provide a better fit to the data. 
Alternatively, when most of the individuals are sampled 
from the initial phases of the epidemic, both the density-de- 
pendent and -independent models give similar estimates of 
R 0 and the number of lineages in the past (see fig. 2). The 
estimated total number of infected individuals at the present, 
however, will then be larger in the density-independent 
model than in the density-dependent model. 

The parameterized SIS model can be used to make predic- 
tions on how the epidemic is expected to progress within 
these clusters. This is in contrast to skyline-plot methods, 
which can only recreate the history of transmission rate 
changes in the past unless a time-dependent transmission 
model is fit to these skyline-plots. 

The accuracy of our model predictions will strongly 
depend on the validity of the assumptions. Here, we 
assume an SIS model where the rate of becoming noninfec- 
tious is equal to the recruitment rate of new susceptibles. This 
approximation is mainly performed for computational trac- 
tability, because the total population size, that is, / + S, 
remains constant over time. Although this assumption is 
clearly violated for diseases where the expected time to 
becoming noninfectious is much shorter than the recruit- 
ment time of new susceptibles (e.g., influenza), it is plausible 
when the two processes happen on a similar timescale. 

It is possible to extend our method to account for 
any kind of compartmental epidemiological models, such 
as a susceptible-infected-recovered (SIR) model. In its pre- 
sent form, however, this extension would come with a 
significant computational cost, because the number of 
recovered individuals would need to be tracked in addi- 
tion to the number of infected individuals. The number 
of differential equations that would then need to be 
solved increases from N to N 2 . In its present form, this 
would only be conceivable when the population size is 
moderate. 

The sampling probability, that is, the ratio of sampling rate 
to total death rate, r = \///(^ + fi), cannot be estimated 
using the density-independent model. In fact, it can be 
shown that for the density-independent model the likelihood 
function only depends on two out of the three parameters 
P,fx, and xjr, namely on the net growth rate (5 — [i — i/r and 
the product of transmission and sampling rate, f}%fr. For the 
density-dependent model, we could not show or disprove 
such a decrease in degrees of freedom of parameter space. 
However, we observed from our simulation study that it is 
possible to estimate the sampling probability for certain 
parameter combinations, though care must be taken to con- 
trol for potential biases as the signal for r is weak. In particular, 
the choice of r only has little effect on LTT plots predicted 
using the estimated parameters, meaning that different r 
explain the data equally well (recall that our data are 
essentially a LTT plot). Furthermore, the inferred cumulative 
number of sampled individuals is not significantly influenced 
by r, another indication that different r explain the data 
equally well. Thus, knowledge obtained from other data 



should be used to supply a prior probability on r. It is 
important to note that the choice of r does influence the 
quantitative value of R 0 as well as the estimated current 
number of infected individuals. 

Overall, we have presented a method that is readily 
available both as an R package and C++ stand-alone 
programs (Leventhal 2013). The method can be applied 
to transmission trees inferred from pathogen sequence 
data in order to obtain better estimates of epidemiological 
parameters such as R 0 , thus providing better insight into 
the transmission dynamics of SIS-type epidemics. 
Additionally, when information on the sampling probabil- 
ity is available from other data sources, reliable estimates 
of the size of the current infected population can be 
obtained. In summary, by applying our method to path- 
ogen sequence data, we can obtain a better understanding 
of the intensity of transmission within different transmis- 
sion clusters, which can help guide and assess public 
health intervention measures. 

Materials and Methods 

SIS Model 

A common way of modeling the spread of a disease through 
a population is with the SIS model (Kermack and McKendrick 
1927; Anderson and May 1991). Individuals can either be in 
a susceptible state S or an infected state /. In this article, 
we use a stochastic SIS model but motivate the choice of 
the model using a deterministic SIS model. 

Deterministic SI/SIS Model 

The change in number of susceptible and infected individuals 
over time can be written as a set of ODEs, 

j t = -0S//N + *(S,O, (6) 

j t =f}Sl/N-yl. (7) 

Here, /J/N is the infection rate per infectious contact and /x 
is the death/removal rate of infected individuals. The recruit- 
ment of new susceptible individuals is given by the arbitrary 
function ^(S,/). We assume that the total number of 
individuals in the population, N, is constant. In this case, 
| + ^ = 0, such that 0(S,/) = y\ and the SI model is 
equivalent to a SIS model (Anderson and May 1991). The 
force of infection A is the rate at which susceptible individuals 
become infected and is proportional to the number infected 
individuals in the population, 

A(l) ee pi/N. (8) 

Using S = N — /, we can rewrite equation (7), 

j t = G6(1 - l/N) - y)l. (9) 
Stochastic SIS Model 

We use a continuous-time Markov chain (CTMC) to model 
the epidemic process and the induced sampled phylogenetic 
trees. We define q ( (t) as the probability that / individuals 
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are infected at time t. The transition probabilities for the 
CTMC process are, 

Prob{/ -> I + 1} = (yS/N)(N - l)ldt, (10a) 

Prob{/ / - 1} = y/dt s (10b) 

Prob{/ /} = 1 - (/?(N - /)//N + y/)dt. (10c) 

We can write down the Kolmogorov forward differential 
equation for q ( (t), 

§=q ( _ 1 (t)^(N-(/-1))(/-1)/N 

at (11) 

+ q / + 1 (tM' + 1) - 0?(N - /)//N + y)q,(t). 

The solution to equation (11) gives us the probability 
of having / infected individuals at time t. The deterministic 
SIS model can be used as an upper bound for the expected 
value of the number of infected individuals at time t 
(Allen 2008), 

< 00(1 - E[/(t)]/N) - y)E[/(t)]. (12) 



Sampled Phylogenies of an Epidemic Outbreak 
In surveillance data, information about the infectious state 
is only available for a subset of individuals. We assume 
that throughout the course of the epidemic, the infected 
individuals are sampled at a constant rate x//. Once they 
are sampled, we assume that these individuals can no 
longer infect anyone else and are removed. This is an appro- 
priate assumption for many diseases where sampling is 
usually linked to drug treatment, isolation, or behavior 
change, after which transmission becomes unlikely (e.g., 
HIV) or recovery is rapid. The sampled transmission tree 
(sampled phytogeny) results from disregarding all non- 
sampled individuals from the complete transmission tree 
(fig. 1). As we assume that sampled individuals are no 
longer infectious, the removal rate y in equations (9) and 
(10b) becomes y = fi + i/r, where /x is the removal rate with- 
out sampling and ifr is the removal rate with sampling. 
We define r as the probability of being sampled upon 
removal, r = tfr/({i + \jr). 

Inferring Epidemiological Parameters from Sampled 
Phylogenies 

Our aim is to infer the parameters of the stochastic SIS model, 
y6,/x,i/f,N, defined by equation (11), based on a sampled phy- 
logeny, T. In the supplementary information, Supplementary 
Material online, we derive the likelihood C(T; 9) that an SIS 
model with parameters 9 = (P,fj.,xfr,N) gave rise to the 
sampled phylogeny. 

Parameter inference 

We use the likelihood function to obtain parameter 
estimates in two ways: (A) a ML framework; (B) by estimating 
the posterior distribution of parameters in a Bayesian 
framework. In the Bayesian framework, we perform a 



Metropolis-Hastings (MH) MCMC estimation of the joint 
likelihood of all the sampled phylogenies to obtain a posterior 
distribution of the parameters (Metropolis et al. 1953; 
Hastings 1970). Let T = [T^T 2 , . . . ,T m ] be a set of iid 
samples chosen from the distribution P(T | 9) of sampled 
phylogenies. The probability density of a tree is the likelihood 
of the tree from the New Approaches section. Because all the 
trees are assumed to be iid, the likelihood of T is the product 
of the likelihoods of the individual trees, 

m 

C(J;0) = \\C(T l ;0). (13) 

i=i 

The full conditionals are not known, and we need to 
resort to a MH approach to sample from the posterior 
distribution of £. Furthermore, the parameters N,/3,fi, and 
\[r are highly correlated, which greatly increases the time 
to convergence of the MCMC chain when using 
traditional MH or sequential Gibbs sampling. We thus 
use Differential Evolution Adaptive Metropolis (DREAM) 
to estimate the posterior density (Vrugt et al. 2009). 
Convergence in the DREAM scheme is determined via the 
Gelman-Rubin convergence diagnostic and is reached when 
the scale reduction factor R c < 1.05 for all parameters 
(Brooks and Gelman 1998). 

Relative Bias 

To determine how well our method estimates the epidemic 
parameters, we look at the relative bias, 

m=^> 04) 

where 9; is the true value of the i-th parameter 9 = (N,/3, 
/x,i/f) and 9j is the mean of the estimated posterior. 

Supplementary Material 

Supplementary methods, figures S1-S4, and tables S1-S8 are 
available at Molecular Biology and Evolution online (http:// 
www.mbe.oxfordjournals.org/). 
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